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1 Abstract 

In microarray experiments, it is often of interest to identify genes which have a pre- 
specified gene expression profile with respect to time. Methods available in the litera- 
ture are, however, typically not stringent enough in identifying such genes, particularly 
when the profile requires equivalence of gene expression levels at certain time points. In 
this paper, the authors introduce a new methodology, called gene profiling, that uses 
simultaneous differential and equivalent gene expression level testing to rank genes ac- 
cording to a pre-specified gene expression profile. Gene profiling treats the vector of 
true gene expression levels as a linear combination of appropriate vectors, i.e., vectors 
that give the required criteria for the profile. This gene-profile model is fitted to the 
data and the resultant parameter estimates are summarized in a single test statistic 
that is then used to rank the genes. The theoretical underpinnings of gene profiling 
(equivalence testing, intersection- union tests) are discussed in this paper, and the gene 
profiling methodology is applied to our motivating stem cell experiment. 

Keywords: Gene expression; Gene profiling; Linear model; Microarray; Pluripotency; 
Stem cell; Time course experiment. 
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2 Introduction 



Microarray technology enables researchers to exami ne the expression levels for many 



thou sands of genes simultaneously (see, for example, iNguyen et all 120021 : ISmyth et al 
2003). Increasingly, informatio n on gene expression is used to infer cell protein lev- 



els and thus cellular be h aviou r rtNguyen et all 120021 : ISmvth et all 1200.4 lAhnert et al 



20061 : iMcLachlan et all l200d ). A further major area of interest is in investigating 



changes in gene express i on levels over time in a population of cells (Dudoi t et al 



Bar-Jos eph et al. 



2005; 



Brown et al 



2002 



20031: iGlonek and Solomoil 120041 : iTai and Speed! . 120051 : lErnst et al 



20061 ; Ahnert et all I2OO6I ) and this is the subject of the present pa- 



per. We refer to the gene expression levels over time as a gene expression profile, or 
profile for short. 

Several methods of analysing gene expression profiles fall into the class of tech- 
niques known as unsupervised learning methods. These methods seek to group genes 
into a number of classes based upon their observed profiles. Some of the m ethodologies 



discu ssed in the recent micr oarray literature are hierarchical classification (lEisen et al 



1998), self-organizing maps (jTamayo et aull999h. the if -means algor i thm flT avazoi e et al. 



1999J), multivariate Gaussian mixtures ( Ghosh and Chinnaivanl . 120021 ; lYeung et al 



2001 



and mixtures of linear mixed models (ICeleux et all 120051 ) . A related problem that arises 
in applications of microarray time course experiments is to specify, in advance, a gene 
expression profile of interest and then to identify the genes with matching expression 
profiles. However, unsupervised methods do not address this problem and various al- 
ternative approaches have been proposed. 

" (|2002h 



One such me t hod is Pareto optimization, proposed by iFleury et al. 



and 



Hero and Fleurvi (|2004l ). in which a set of functions, each measuring the association 



of a gene to a pre-specified profile, is chosen. Genes found to be Pareto-optimal with 
respect to these criteria are identified as matching the pre-specified profile. The main 
disadvantage with Pareto optimization is that some genes will be selected as Pareto- 
optimal genes whilst only matching the pre-specified profile for a subset of the profile's 
criteria. 

In an unpublished paper, lLonnstedt eloZI (|2003l ) describe a different method for 
ranking genes, based on the inner product between the vector of observed log ratios and 
a pre-specified profile. This method works well for some profiles, but did not provide 
useful outcomes in our application. 

Gene profiling is a new approach developed by the present authors, which aims to 
identify genes that match a pre-specified gene expression profile, with greater specificity 
than the previously described approaches. Gene profiling entails treating the vector of 
true gene expression levels for each gene as a linear combination of linearly independent 
vectors chosen to represent the pre-specified profile. The gene-profile model is fitted 
to the observed log ratios, and the genes are ranked by a single test statistic which 
incorporates simultaneous differential and equivalent gene expression testing. 

In Section [31 our motivation for gene profiling is presented. Section 13.11 sets out the 
details of the experimental design for a pluripotent (stem cell) time course experiment 
which provided our initial motivation for the ensuing methodological development. The 
theoretical underpinnings of gene profiling are described in Section 01 which entails a 
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review of equivalence testing (Section 14.21) and intersection- union tests (Section 14. 3|) . 
The gene profiling methodology is set out in Section 14.41 and the results obtained from 
our application to a stem cell experiment are presented in Section [5] In Section [6J some 
further work and how to apply the methods in limma are briefly discussed. 



3 Motivation: pluripotency 

Our motivating example is a stem cell experiment originally conducted by the Rath- 
jen laboratory, formerly of the University of Adelaide. The aim of the ex periment was to 



ident i fy genes associated with pluri potency in mice embryonic stem cells (jD'Amour and Gagd . 



20031 : iRamalho-Santos et all 120021 ) . Early stem cells have the potential to differentiate 



into any body cell: a property known as pluripotency. This ability is present in mice 
stem cells up to and including day three. After this the stem cells become multipotent: 
they still have the ability to differentiate into different types of cells, but now a limited 
number. For example, haemopoetic stem cells can differentiate into blood cells but 
not nerve cells. As pluripotency is restricted to the early stem cells, day 3 or earlier, 
genes that have high expression levels in cells up to day 3, but low, or monotonically 
decreasing expression levels thereafter, are likely to be associated with the biochemical 
pathways involved in the pluripotency ability of these cells (personal communication, 
Dr Chris Wilkinson). 



3.1 Pluripotency example: experimental design 

Stem cells were isolated from the early embryo and grown in culture dishes. The cells 
were allowed to replicate and grow over the medium in the dish. Once the cells had 
crowded the plate, they were removed, separated and plated onto new plates. This 
cycle of growth and re-plating is called a passage. The Rathjen laboratory isolated mice 
embryonic stem cells, and for this experiment, used cells from passages 21, 22, 23 and 
24. The cells were stimulated to differentiate into multipotent cells, and on days 0, 3, 6 
and 9 after stimulation, samples were taken and the messenger RNA (mRNA) obtained. 

The gene expressions of the 16 samples of stem cell mRNA for the four days (0, 3, 6, 
9) and four passages, were measured. Within each passage, five comparisons were made, 
namely, day to day 3, day to day 9, day 3 to day 6, day 3 to day 9, and day 6 to day 
9. The experimental design in terms of the true gene expression levels, /x (see Section 
is summarized in Figure [U while the experimental design in terms of the gene profiling 
parameters, 7 (Section^]), is compared to the design in terms of the true expression levels 
in Table [H The clone library used in the experiment was the Compugen 22,000 mouse 
oligonucleotide library (http://www.microarray.adelaide.edu.au/libraries/). In total, 20 
arrays were hybridized on two-colour long-oligonucleotide microarrays, with five arrays 
within each passage. The five arrays consisted of the five comparisons detailed above. 
In this analysis, the stem cells from each passage were treated as independent biological 
replicates. 
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1^9 T M6 

Figure 1: Microarray comparisons made within each passage for the pluripotency stem 
cell experiment. Each arrow represents two arrays, one for each passage (passage 21/22 
continuous arrow, passage 23/24 dashed arrow), with the arrow head pointing to the 
sample labeled with cy5, and the sample at the arrow tail labeled with cy3. Day 0, 3, 
6, and 9 are represented by no, /X3, and /j,g respectively. 
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in terms of //'s 


in terms of 7's 





^0 


7o + 7i + 72 + ^73 


3 




7o + 7i + 72 - ^73 


6 




7o + 72 


9 


1*9 


7o 



Table 1: Parameterization of stem cell experiment in terms of absolute mean gene ex- 
pressions (/ij, i = 0,3, 6, 9) and in terms of the gene profile coefficients (7$, % = 0, 1, 2, 3). 
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4 Gene profiling methodology 



4.1 Development of method for stem cell experiment 

The expression criteria over time required for a pluripotent gene are: 

• equal gene expression levels for days and 3, 

• higher gene expression levels for days and 3 compared to day 9, and 

• the gene expression level for day 6 to lie between the gene expression levels for 
day and day 3, and the gene expression level for day 9. 

The requisite (hypothetical) profile is illustrated in Figure 

Consider the vector of true mean gene expression levels, /x = ^3, ftj ^9)') where 
= 0,3,6,9 is the mean gene expression level on day i as shown in Figure [TJ Since 
this is a vector in M 4 , it can be expressed as the linear combination of four linearly 
independent vectors. The first step in gene profiling is to choose vectors that represent 
the criteria for pluripotency. In the present example, this corresponds to 
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(4.1) 



With this choice of model, it follows that 70 = ^9, 71 = (/io + ^3)/% — fJ-6, 72 = — ^9; 
and 73 = fio — Therefore, the pluripotent profile requires that 71 > 0,72 > 0,73 = 0, 
but does not constrain 70. To find genes that achieve these criteria requires tests for 
equivalence as well as (simultaneous) tests for differential gene expression. In the next 
section, equivalence testing is discussed. We then describe how to simultaneously test 
for both differential and equivalent gene expression in a time course experiment. 



4.2 Statistical Equivalence 

To determine pluripotency, it is necessary to demonstrate that 73 = 0. Conventional hy- 
pothesis tes ting is not app licable to this situation, but the equivalence testing approach 
discussed in lWellekl rt2002h is. 

If X is a random vector whose probability distribution depends on a real-valued 
parameter 9, then to test if is equivalent to zero, a neighbourhood around zero is 
constructed and the following null and alternative hypotheses are tested: 



H : 
H a : 



> e, e > 0, 
< e. 



(4.2) 



The neighbourhood defined by e is the maximum that the parameter can vary and still 
be considered equivalent to zero. This neighbourhood is necess ary to ensure that the 
power of the statistical test is greater than its significance level (jWellekl . |2002| ) . 
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Figure 2: The pre-specified gene expression profile for pluripotent genes. For each day, 
the log ratio with respect to day is plotted. 

For the gene profiling model, the parameter e is taken to be the largest that a gene's 
mean log ratio can vary around zero and not be of "significant" gene expression, ac- 
cording to biologists. In practice, a working understanding of equivalent gene expression 
should be decided upon in advance in consultation with biologists. Unfortunately how- 
ever, relatively little is known about gene-specific variation per se: information that 
could of course be used to decide on an appropriate value of e. We discuss potentially 
suitable choices of e in Section [5j but for the present we will assume an appropriate e 
to be available. Using such a value of e, the simplest and most common way to test the 
hypotheses in (|4,2|) is via Confidence Interval Inclusion (CII). 

Consider the null and alternative hypotheses specified in (|4.2p . We calculate a 
confidence interval, R a (X), from the observed data X, where 



L a (X) and U a {X) are random variables, such that 

P {9 G (L a (X), oo)) = P (6 G (-oo, U a {X))) = 1 - a. 
We reject the null hypothesis in favour of equivalence if and only if 

R a (X) C (-e,e), 
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R a (X) = (L a (X),U a (X)); 



(4.3) 



i.e., the confidence interval is contained entirely within the interval (-e, e). This is an 
a-level test. 

The equivalence formulation can be used to test that 73 in equation (|4.ip is equivalent 
to zero with the following null and alternative hypotheses: 

H : |73| > e vs. H a : | 73 | < e. (4.4) 

For example, to test the hypotheses in (|4.4|) . the confidence interval 

( 7 3-t*SE(73),73 + t*SE(7 3 )), 

is calculated and 73 is concluded to be equivalent to zero if this confidence interval lies 
within (-e,e). In this confidence interval, t* is chosen such that P(T > t*) = a, where 
T has a t-distribution with the appropriate degrees of freedom for 73. 

Confidence interval inclusion can also be used to (separately) test whether 71 and 
72 are significantly positive. The null and alternative composite hypotheses for 71 are 

H : 71 < vs. H a :-fi> 0, 

and for 72 are 

Hq : 72 < vs. H a : 72 > 0. 

For an a-level test here, a one-sided (1 — a) 100% confidence interval for 71 is calcu- 
lated: 



( 7 i-i*SE( 7l ),oo), 

and if this interval is contained in (0,oo), 71 is concluded to be significantly positive. 
Similarly for 72. 

These methods allow testing of each criterion separately, but for pluripotency all 
three criteria need to be valid simultaneously. The authors' method to simultaneously 
test for both equivalence of parameters to zero and significant departures of parameters 
from zero is described in the next section. 



4.3 Intersection-Union test 

The test for each criterion discussed in Section 14.21 can be incorporated simultaneously 
into a single null and a single alternative hypothesis as follows: 

#0 : (71 < 0) U (72 < 0) U (Its I > <0 , e > 0, (4.5) 
versus H a : ( 7l > 0) f] ( 72 > 0) f] (| 73 | < e) . (4.6) 



The hypotheses in (|4.5p and (|4,6p represent an intersection-union test (IUT) (|Bergerl . 



19821 ) . To review, in an IUT, the null hypothesis is expressed as a union, 



Ho:9e\jQ y , 

7er 



7 



where 7 is a subset of the parameter space indexed by 7. The rejection region R of 
this IUT is of the form R = P|^ gr i? 7 , where i? 7 is the rejection region for a test of 
H07 : ^ € 7 versus iii 7 : 9 € 0£. This is an a-level test, where a = sup 7gr a 7 and a 7 
is the size of the test , with rejection region R 7 . 

Thus for each = 1,2,3, in the null hypothesis statement (14. 5|) . a test of size 
oci is found, and the overall IUT will be of level sup a^. Using the confidence interval 
inclusion method discussed in the previous section to test each ji separately, each test 
being of level a, gives an overall a-level test. 

Our main aim is to rank the genes in our motivating example according to their 
match with the pluripotent profile. The testing methodology described can be modified 
to give a quantitative measure of how closely each gene matches the desired profile. 
Considering each gene separately, for each parameter, = 1,2,3, confidence interval 
inclusion is used to test the associated null hypothesis. Rather than using a fixed 
significance level, the smallest significance level, a^, for each = 1,2,3 respectively, 
is found, such that the null hypothesis is rejected. The supremum of cti,i = 1,2,3 is 
used as the test statistic to rank the genes. In fact, in the stem cell experiment, rather 
than calculate oti for each 7^, i = 1, 2, 3, the width of the largest confidence interval, Ui, 
for each 7^ that was contained within the rejection region was used. The infimum, U, 
of the Ui was then used to rank the genes (it should be noted that this is equivalent to 
ranking based on supc^). 

To further elucidate the method, consider FigureO This illustrates a two-dimensional 
example where the criteria are 71 = and 72 > 0. The rejection region is indicated 
by the rectangular shaded region. The point (71,72) is the estimate of (71,72). The 
distance to the nearest boundary of the rejection region is calculated in standard errors 
of the estimate and this distance is used to rank the genes, with larger values indicative 
of association with pluripotency. Genes whose profiles do not lie within the rejection 
region are excluded from the ranking. 

The above development leads to the general methodology for determining pluripo- 
tency described in the next section. 



Rejection region 




Figure 3: Illustration: for each gene, the distance from (71,72) to the nearest boundary 
of the rejection region is used to rank the genes. 
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4.4 Gene profiling for pluripotency 

The scanned imag es for each hybridized microarray slide were an alysed using SPOT 



(Yan g et al 



2001 



t o give the cy3 and cy5 intensities for each gene (jYang et all , 12001 



Adams and Bischoi . 19941 ). The data were then normalized by within-array print-tip 



loess , and t he gene profile model was fitt e d to t he normalized data using limma (jSmythl . 
2005l i in R (|R Development Core Teanl l200d ). For each gene, the model parameter 
estimates and standard errors obtained by limma were used to calculate the U statistic 
(see below) using C code embedded in R code. The genes were then ranked using the 
U statistic. 

The vector of observed log ratios M was expressed as a linear model of the true 
gene expression levels fi as follows: 

M = X*fj, + E, 

where X* is the design matrix representing the mRNA comparisons made on each array, 
and E is assumed to be distributed as A*2o(0, a 2 I). Using equation (|4.ip to substitute 
for fx, gives 



M 



X 



/I 1 
1 1 

1 

\i o 

X-y + E. 



1/2 \ 
-1/2 




\ 


( 70 \ 




7i 




72 


/ 


V 73 / 



+ E 



In the stem cell experiment, the microarray platform used was two-colour long oligonu- 
cleotide which, as for cDNA microarrays, measures relative gene expression, but not 
absolute gene expression levels. Therefore, the overall gene expression level, 70, could 
not be estimated and was removed from the model by changing the parameter vector 
to (71,72,73) and removing the first column of X. 

Estimates of 7 were calculated via least squares, and the estimate of a 2 was ob- 
tained using the empirical Bayes method utilized in limma; this gives a robust posterior 
estimate of a 2 based on a prior which "borrows" information from the observed variance 
of all the genes on the array. 

For each gene, three tests statistics, U%, U2 and ^3 were calculated as follows: 



Ui 



71 



SE( 7 i)' 



72 



SE( 72 



173 1 



SE( 73 



where SE(7j) is the ith diagonal element of the square matrix: s^J (X'X)' 1 , and s is the 
posterior estimate of a. The minimum of Ui,i = 1,2,3, is used to rank the genes, for 
which genes with larger values of U are more likely to be associated with pluripotency. 

Genes whose estimate (71,7*2,73) of (71, 72, 73) did not lie within the rejection region, 
i.e. those genes for which at least one Ui,i = 1,2,3 was negative, were excluded from 
the ranking. 
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5 Application: determining genes associated with pluripo- 
tency using gene profiling 



The model (|4.ip was fitted to the stem cell data with e = 1. In addition, the test 

statistics were changed to test for 72 > 1.5, i.e., U% = <j£T,V 5 . . The value of 1.5 was 

011472) 

chosen to ensure a large difference between the gene expression levels on days 0, 3 and 
6 compared with the gene expression level on day 9. 

The ranked genes are given in Table [21 and the fitted profiles for these 15 genes are 
shown in Figure [H Figure H] shows the fitted log ratios with respect to day for the 
four time points: day 0, day 3, day 6, and day 9. Therefore, all of the profiles will 
pass through zero on day 0. The profiles demonstrate the required trajectory: equal 
expression for day and day 3, higher gene expression levels for days and 3 compared 
to day 9, and the gene expression level for day 6 lying between the gene expression levels 
for days and 3 and that for day 9. 
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Table 2: The ranked genes from fitting pluripotent profile (|4.1p to the stem cell data. 



T he top-ranked gen e, Oct4, is well-known to be associated with pluripotency (IRodda et al 



20051 : lLoh et all . l200d ) and would therefore be expected to appear amongst the top- 
ranked genes for pluripotency in this experiment. Other genes of note in the ranked 
genes in Table [2] are Utfl (rank 2) which is associated with undifferentiated embry- 
onic cell transcription (jNishimoto et all. 12005!) . and N anog (rank 11) which is central to 



embryonic stem cell plu ripotency (IWang et al. . 20061 ). 



The recent article by Wang et al. ( 20061 ) isolated proteins associa ted with the protei n 
Nanog and thus with pluripotency. Of the 38 proteins discussed in Wang et al. (j2006l ) . 
Oct4 and Nanog appeared in our list of ranked genes using model (|4.1|) : ranks 1 and 11 
respectively. The remaining proteins were not in the ranked genes as the profiles of the 
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associated mRNAs are not consistent with profile (|4.ip . 




Sensitivity analysis: As stressed previously, the choice of the neighbourhood around 
zero assumed for equivalence (i.e., e) should be decided upon in consultation with bi- 
ologists. However, this is problematic since biologists still have relatively little explicit 
knowledge of gene-wise expression variability, and therefore, what precisely and quan- 
titatively may represent equivalence of gene expression. 

To investigate the potential effects of altering the neighbourhood defined by e, the 
primary analysis was repeated assuming, respectively, values of e = 0.5,1,1.5, and 2. 
In Figure [5] the profiles for the genes which have observed profiles that lie within the 
rejection region are plotted for each choice of equivalence neighbourhood. As the equiv- 
alence neighbourhood width (e) increases, more genes have profiles that lie within the 
rejection region, but there is greater variation between the gene expression levels for 
day and day 3. Nevertheless, gene profiling in this application has been demonstrated 
to be reasonably robust. For e=0.5, 1 and 1.5, Oct4 was ranked as the top gene, while 
for e=2, it had only dropped to rank 2. 

Profiling f or Sox2: It is well k nown that the gene Sox2 is commonly associated with 
pluripotency ( Rodda et al . 20051 ). but it was not in the ranked genes using the gene 



expression profile (|4.ip : the fitted gene expression profile for Sox2 is very different from 
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(a) (b) 




Figure 5: Fitted log ratios with respect to day for the ranked genes with (a) e = 0.5, 
(b) e = 1, (c) e = 1.5, and (d) e = 2. 



the pluripotent profile used in the analysis. The criteria for the profile of Sox2 are: 
higher gene expression level on day compared to the gene expression levels for days 6 
and 9; equivalent gene expression levels on days 6 and 9; and the gene expression level 
for day 3 to lie between the gene expression level for day and the levels for days 6 and 
9. Gene profiling can be used to rank the genes according to these alternative criteria. 
An appropriate model for Sox2 is: 

/111 \ 
10 10 

1 o o I 7 ' 
V 1 o o -I J 

in which 70 is unrestrained, 71 > 0, 72 > 0, and 73 is equivalent to zero. This 
model was fitted to the data and the ranked genes are shown in Figure EJ The ranked 
genes were Cptla, 1200014E20Rik, 2210409E12Fiik, Sox-2, Np-1, Birc5, 5730419I09Rik, 
MGI: 1922156, retSDR3, and clone RP21-505L19 on chromosome 5. Sox2 was ranked at 
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position 4. Of note is retSDR3. This gene has the required form with a larger difference 
in gene expression between day and days 6 and 9 compared to the other genes. Even 
with this large difference, retSDR3 is low down in the ranking at rank 9. This low 
ranking is because retSDR3 has a large gene expression variance (0.181) compared to 
the other ranked genes (average gene expression variance of 0.054). This illustrates that 
if two genes have the same coefficient values, gene profiling will rank lower the gene 
which has the larger variance and thus more uncertainty about its true profile. 




Day 

Figure 6: Fitted log ratios with respect to day for the top 10 ranked genes for the 
Sox2 profile. 



6 Discussion of further work 

In general, gene profiles of interest to molecular biologists often consist of two types of 
criteria: equal gene expression at different time points and differential gene expression 
at different time points. Gene profiling provides a straightforward methodology to filter 
genes which satisfy these two types of criteria simultaneously. We believe that this has 
not been accomplished using previously available techniques. By simultaneously testing 
for all criteria, gene profiling effectively filters out and excludes genes that are only 
partially consistent with the required profile. We now touch on some areas requiring 
further work. 
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Choice of e: As noted in Section 14.31 to test for a parameter being equal to zero, 
a neighbourhood of width e is defined. This neighbourhood is the amount that the 
parameter could vary and still be considered equivalent to zero. In this paper, the 
choice of e was based on plotting profiles for the various choices of e, and choosing 
the best e to give the required pre-specified profiles. Ideally, the choice of e should be 
based on consultation with biologists, to the extent that such knowledge is available. 
One would anticipate that such requisite knowledge will gradually accrue over time, as 
microarray and other new genomics technologies are more widely applied in molecular 
biology and genetics. 

Invariance of parameterization: Another area requiring further research is the in- 
variance (or otherwise) of reparameterization. In his (2002) book, Wellek notes: 

"... in contrast to the corresponding conventional testing problems with 
the common boundary of null and alternative hypothesis [sic] being given by 
zero, equivalence problems remain generally not invariant under redefinitions 
of the main parameter." 

To illustrate this point, consider the problem of finding marker genes for day 3 in the 
stem cell experiment. The criteria for such genes are: high gene expression level on 
day 3, as well as equal and low gene expression levels on day 0, day 6 and day 9. 
The requisite profile is illustrated in Figure [7J Examination of the profile reveals three 
possible models: 
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where 7 = (70,71,72,73)' with 70 unrestrained, 71 significantly positive, 72 equivalent 
to zero, and 73 equivalent to zero. 

The three models may not necessarily give the same results. This is because equiv- 
alence is not transitive, i.e., if /xo is equivalent to fj,e, and pbQ is equivalent to fj,$, it is 
not necessarily true that is equivalent to /Ug. This is because equivalence is defined 
in a neighbourhood and so a "drift" resulting in hq an d /ig being too far apart to be 
considered equivalent can occur. Methods to impose invariance are currently under in- 
vestigation by the authors. Although this is an interesting area of research, invariance of 
reparameterization does not limit the use of gene profiling. There are many pre-existing 
statistical tests, e.g., the Wald test, that are not invariant under reparameterization. In 
many cases, the research hypotheses will dictate the optimal model to use. 

Applying gene profiling using limma: Gene profiling is easily implemented by fitting 
the model to the data using limma and then calculating the U statistics. The calculation 
of the U statistics was written in C to decrease the run time, but is easy to implement 
in R. 

To conclude, gene profiling introduces a flexible method to select genes for a pre- 
specified time-course profile. Gene profiling is straightforward to implement in practice, 
requiring only small modifications to the R package limma, and can be used to select for 
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most profiles of interest to biologists. The application of gene profiling in this article 
has been to two-colour microarrays, but it could readi ly be modified for use for other 
microarrays platforms, such as Affymetrix GeneChip ( Lockhart et al. . 1996), and for 
other technologies where it is required to rank observations by correspondence with a 
pre-specified profile. 
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Figure 7: Hypothetical gene expression profile for a day 3 marker gene. 
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